Search CORE

133 research outputs found

Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms

Author: AP Reynolds
C Lucasius
E Schubert
H Bock
H Kriegel
H Park
Leonard Kaufman
ML Overton
RT Ng
V Estivill-Castro
V Estivill-Castro
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 29/10/2019
Field of study

Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not hold for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains such as biology that require the use of Jaccard, Gower, or more complex distances. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm to achieve an O(k)-fold speedup in the second SWAP phase of the algorithm, but will still find the same results as the original PAM algorithm. If we slightly relax the choice of swaps performed (at comparable quality), we can further accelerate the algorithm by performing up to k swaps in each iteration. With the substantially faster SWAP, we can now also explore alternative strategies for choosing the initial medoids. We also show how the CLARA and CLARANS algorithms benefit from these modifications. It can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important. In experiments on real data with k=100, we observed a 200-fold speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets as long as we can afford to compute a distance matrix, and in particular to higher k (at k=2, the new SWAP was only 1.5 times faster, as the speedup is expected to increase with k)

arXiv.org e-Print Archive

Crossref

Run Generation Revisited: What Goes Up May or May Not Come Down

Author: A Aggarwal
BJ Gassner
CL Mallows
DE Knuth
DE Knuth
EH Friend
G Graefe
MA Goetz
V Estivill-Castro
W Frazer
X Martinez-Palau
YC Lin
YC Lin
Publication venue
Publication date: 24/04/2015
Field of study

In this paper, we revisit the classic problem of run generation. Run generation is the first phase of external-memory sorting, where the objective is to scan through the data, reorder elements using a small buffer of size M , and output runs (contiguously sorted chunks of elements) that are as long as possible. We develop algorithms for minimizing the total number of runs (or equivalently, maximizing the average run length) when the runs are allowed to be sorted or reverse sorted. We study the problem in the online setting, both with and without resource augmentation, and in the offline setting. (1) We analyze alternating-up-down replacement selection (runs alternate between sorted and reverse sorted), which was studied by Knuth as far back as 1963. We show that this simple policy is asymptotically optimal. Specifically, we show that alternating-up-down replacement selection is 2-competitive and no deterministic online algorithm can perform better. (2) We give online algorithms having smaller competitive ratios with resource augmentation. Specifically, we exhibit a deterministic algorithm that, when given a buffer of size 4M , is able to match or beat any optimal algorithm having a buffer of size M . Furthermore, we present a randomized online algorithm which is 7/4-competitive when given a buffer twice that of the optimal. (3) We demonstrate that performance can also be improved with a small amount of foresight. We give an algorithm, which is 3/2-competitive, with foreknowledge of the next 3M elements of the input stream. For the extreme case where all future elements are known, we design a PTAS for computing the optimal strategy a run generation algorithm must follow. (4) Finally, we present algorithms tailored for nearly sorted inputs which are guaranteed to have optimal solutions with sufficiently long runs

arXiv.org e-Print Archive

Crossref

Tradeoffs Between Branch Mispredictions and Comparisons for Sorting Algorithms

Author: A. Colin
C. Levcopoulos
G.S. Brodal
H. Manilla
J.L. Hennesy
J.W.J. Williams
K. Mehlhorn
P. Sanders
R.W. Floyd
T.H. Cormen
V. Estivill-Castro
V. Estivill-Castro
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

Crossref

Quantifying Privacy: A Novel Entropy-Based Measure of Disclosure Risk

Author: A Oganian
C Dwork
CCM Fung
CJ Skinner
D Lambert
DE Denning
F Al-Saggaf
GT Duncan
JR Griggs
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Sankar
L Willenborg
M Trottini
N Lopez
N López
NR Adam
P Horak
P Tendick
R Ahlswede
S Fletcher
S Morris
T King
V Estivill-Castro
V Estivill-Castro
WA Fuller
WE Winkler
WE Yancey
Y Al-Saggaf
Publication venue
Publication date: 07/09/2014
Field of study

It is well recognised that data mining and statistical analysis pose a serious treat to privacy. This is true for financial, medical, criminal and marketing research. Numerous techniques have been proposed to protect privacy, including restriction and data modification. Recently proposed privacy models such as differential privacy and k-anonymity received a lot of attention and for the latter there are now several improvements of the original scheme, each removing some security shortcomings of the previous one. However, the challenge lies in evaluating and comparing privacy provided by various techniques. In this paper we propose a novel entropy based security measure that can be applied to any generalisation, restriction or data modification technique. We use our measure to empirically evaluate and compare a few popular methods, namely query restriction, sampling and noise addition.Comment: 20 pages, 4 figure

arXiv.org e-Print Archive

University of Newcastle's Digital Repository

Crossref

BETULA: Numerically Stable CF-Trees for BIRCH Clustering

Author: AP Dempster
C Fraley
H Fichtenberger
H-P Kriegel
J Han
MM Breunig
P Kranen
RE Bonner
RT Ng
T Zhang
V Estivill-Castro
V Ganti
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 23/06/2020
Field of study

BIRCH clustering is a widely known approach for clustering, that has influenced much subsequent research and commercial products. The key contribution of BIRCH is the Clustering Feature tree (CF-Tree), which is a compressed representation of the input data. As new data arrives, the tree is eventually rebuilt to increase the compression. Afterward, the leaves of the tree are used for clustering. Because of the data compression, this method is very scalable. The idea has been adopted for example for k-means, data stream, and density-based clustering. Clustering features used by BIRCH are simple summary statistics that can easily be updated with new data: the number of points, the linear sums, and the sum of squared values. Unfortunately, how the sum of squares is then used in BIRCH is prone to catastrophic cancellation. We introduce a replacement cluster feature that does not have this numeric problem, that is not much more expensive to maintain, and which makes many computations simpler and hence more efficient. These cluster features can also easily be used in other work derived from BIRCH, such as algorithms for streaming data. In the experiments, we demonstrate the numerical problem and compare the performance of the original algorithm compared to the improved cluster features

arXiv.org e-Print Archive

Crossref

On the Recognition of Four-Directional Orthogonal Ray Graphs

Author: A.M.S. Shrestha
D. Coppersmith
I.-A. Hartman
J. Kratochvíl
J. Kratochvíl
J. Kratochvíl
J.A. Soto
M. Habib
R. Uehara
S. Cabello
S. Felsner
S. Felsner
T. Feder
V. Estivill-Castro
W. Rao
Y. Otachi
Publication venue: Springer Verlag
Publication date: 01/01/2013
Field of study

Orthogonal ray graphs are the intersection graphs of horizontal and vertical rays (i.e. half-lines) in the plane. If the rays can have any possible orientation (left/right/up/down) then the graph is a 4-directional orthogonal ray graph (4-DORG). Otherwise, if all rays are only pointing into the positive x and y directions, the intersection graph is a 2-DORG. Similarly, for 3-DORGs, the horizontal rays can have any direction but the vertical ones can only have the positive direction. The recognition problem of 2-DORGs, which are a nice subclass of bipartite comparability graphs, is known to be polynomial, while the recognition problems for 3-DORGs and 4-DORGs are open. Recently it has been shown that the recognition of unit grid intersection graphs, a superclass of 4-DORGs, is NP-complete. In this paper we prove that the recognition problem of 4-DORGs is polynomial, given a partition {L,R,U,D} of the vertices of G (which corresponds to the four possible ray directions). For the proof, given the graph G, we first construct two cliques G 1,G 2 with both directed and undirected edges. Then we successively augment these two graphs, constructing eventually a graph TeX with both directed and undirected edges, such that G has a 4-DORG representation if and only if TeX has a transitive orientation respecting its directed edges. As a crucial tool for our analysis we introduce the notion of an S-orientation of a graph, which extends the notion of a transitive orientation. We expect that our proof ideas will be useful also in other situations. Using an independent approach we show that, given a permutation π of the vertices of U (π is the order of y-coordinates of ray endpoints for U), while the partition {L,R} of V ∖ U is not given, we can still efficiently check whether G has a 3-DORG representation

Durham Research Online

CiteSeerX

Crossref

Vertex Cover Kernelization Revisited: Upper and Lower Bounds for a Refined Parameter

Author: A. Schrijver
A. Soleimanfallah
B. Chor
B.M.P. Jansen
B.M.P. Jansen
C.K. Yap
F.N. Abu-Khzam
F.N. Abu-Khzam
G. Gutin
G. Nemhauser
H. Dell
H.L. Bodlaender
H.L. Bodlaender
H.L. Bodlaender
I. Razgon
J. Chen
J. Chen
J. Díaz
J. Guo
J. Uhlmann
J. Zito
J.F. Buss
J.R. Griggs
L. Cai
L. Fortnow
M. Chlebík
M. Cygan
M. Cygan
M. Dom
M.R. Fellows
M.R. Fellows
M.R. Fellows
M.R. Garey
R. Downey
R. Niedermeier
R. Niedermeier
R. Niedermeier
R.G. Downey
S. Khot
S. Kratsch
S. Mishra
V. Bafna
V. Estivill-Castro
V. Raman
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

An important result in the study of polynomial-time preprocessing shows that there is an algorithm which given an instance (G,k) of Vertex Cover outputs an equivalent instance (G',k') in polynomial time with the guarantee that G' has at most 2k' vertices (and thus O((k')^2) edges) with k' <= k. Using the terminology of parameterized complexity we say that k-Vertex Cover has a kernel with 2k vertices. There is complexity-theoretic evidence that both 2k vertices and Theta(k^2) edges are optimal for the kernel size. In this paper we consider the Vertex Cover problem with a different parameter, the size fvs(G) of a minimum feedback vertex set for G. This refined parameter is structurally smaller than the parameter k associated to the vertex covering number vc(G) since fvs(G) <= vc(G) and the difference can be arbitrarily large. We give a kernel for Vertex Cover with a number of vertices that is cubic in fvs(G): an instance (G,X,k) of Vertex Cover, where X is a feedback vertex set for G, can be transformed in polynomial time into an equivalent instance (G',X',k') such that |V(G')| <= 2k and |V(G')| <= O(|X'|^3). A similar result holds when the feedback vertex set X is not given along with the input. In sharp contrast we show that the Weighted Vertex Cover problem does not have a polynomial kernel when parameterized by the cardinality of a given vertex cover of the graph unless NP is in coNP/poly and the polynomial hierarchy collapses to the third level.Comment: Published in "Theory of Computing Systems" as an Open Access publicatio

arXiv.org e-Print Archive

Crossref

Springer - Publisher Connector

Genetic weighted k-means algorithm for clustering large-scale gene expression data

Author: CR Reeves
Fang-Xiang Wu
FX Wu
G Rudolph
G Sherlock
H Spath
J Hartigan
K Krishna
KS Al-Sultan
KY Yeung
L Hubert
LO Hall
LY Tseng
MT Laub
P Franti
P Scheunders
RO Duda
S Chu
S Dudoit
S Theodoridis
U Maulik
V Estivill-Castro
WM Rand
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Defining Asymptotic Parallel Time Complexity of Data-dependent Algorithms

Author: B.L. Golden
D. Miller
Dolores Rexachs
E. Allender
E. Balas
Emilio Luque
J.K. Lenstra
Paula Fritzsche
R.E. Bland
S.G. Kolliopoulos
V. Estivill-Castro
W.H. Burge
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Query Processing in Spatial Databases Containing Obstacles

Author: Becker B.
Brinkhoff T.
Dimitris Papadias
Estivill‐Castro V.
Ghosh S.
Guttman A.
Jun Zhang
Kung R.
Kyriakos Mouratidis
Papadias D.
Pocchiola M.
Rivière S.
Sellis T.
Sharir M.
Zhu Manli
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2005
Field of study

Despite the existence of obstacles in many database applications, traditional spatial query processing assumes that points in space are directly reachable and utilizes the Euclidean distance metric. In this paper, we study spatial queries in the presence of obstacles, where the obstructed distance between two points is defined as the length of the shortest path that connects them without crossing any obstacles. We propose efficient algorithms for the most important query types, namely, range search, nearest neighbours, e-distance joins, closest pairs and distance semi-joins, assuming that both data objects and obstacles are indexed by R-trees. The effectiveness of the proposed solutions is verified through extensive experiments

CiteSeerX

Crossref

Institutional Knowledge at Singapore Management University